You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Integration tests were failing due to 403 errors when downloading the sklearn california housing dataset. Bundling the csv in the repo to prevent that.
How to test
What needs special review?
Dependencies, breaking changes, and deployment notes
Release notes
Checklist
What and why
Screenshots or videos (Frontend)
How to test
What needs special review
Dependencies, breaking changes, and deployment notes
Pull requests must include at least one of the required labels: internal (no release notes required), highlight, enhancement, bug, deprecation, documentation. Except for internal, pull requests must also include a description in the release notes section.
This pull request introduces significant enhancements to the California housing dataset module. The changes primarily focus on improving the data loading mechanism by:
Introducing a new API in the load_data function that supports a 'bundled' data source as an alternative to the default sklearn fetch. The function will first attempt to load the dataset from a bundled CSV file. If the file is absent or the columns do not match the expected ones, it falls back to fetching the data using sklearn.
Adding robust error handling in the helper function _load_from_sklearn to capture common issues such as HTTP 403 errors or network-related problems. Detailed error messages are provided to guide the user on potential resolutions, including instructions for manually downloading the dataset if necessary.
Including a helper script generate_california_housing_csv.py that downloads the dataset (using the same fallback mechanisms) and saves it as a CSV file in the repository. This script assists in generating the bundled version of the dataset, ensuring that the repository can serve the dataset without always relying on an external download.
These changes aim to improve data reliability, user experience, and local caching of the dataset while providing clear diagnostic feedback when operations fail.
Test Suggestions
Test loading data with the 'bundled' source when the CSV file exists and contains the correct columns.
Test loading data with the 'bundled' source when the CSV file is missing to ensure it falls back to the sklearn fetch.
Simulate a scenario where the bundled CSV is present but has missing or incorrect columns to validate the fallback mechanism.
Test for error conditions by providing an invalid source parameter to ensure the appropriate ValueError is raised.
Test the helper script by running it in an environment without a cached dataset to ensure it can download and generate the CSV file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
internalNot to be externalized in the release notes
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
What and why?
Integration tests were failing due to 403 errors when downloading the sklearn california housing dataset. Bundling the csv in the repo to prevent that.
How to test
What needs special review?
Dependencies, breaking changes, and deployment notes
Release notes
Checklist